R을 이용한 데이터 시각화/ 머신러닝 숙제 4

Author

waterfirst

Published

July 27, 2024


1 시각화 따라 해 보기

아래 사이트 접속 후 Rstudio에서 그래프 하나씩 따라해보세요.

[한국 R 사용자회 – 챗GPT 데이터 시각화 (r2bit.com)] https://r2bit.com/bitSlide/chatgpt_viz_202406.html#/데이터-시각화

library(tidyverse)
library(plotly)
library(gapminder)
library(crosstalk)
library(leaflet)
library(flipbookr)

head(gapminder)
# A tibble: 6 × 6
  country     continent  year lifeExp      pop gdpPercap
  <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
1 Afghanistan Asia       1952    28.8  8425333      779.
2 Afghanistan Asia       1957    30.3  9240934      821.
3 Afghanistan Asia       1962    32.0 10267083      853.
4 Afghanistan Asia       1967    34.0 11537966      836.
5 Afghanistan Asia       1972    36.1 13079460      740.
6 Afghanistan Asia       1977    38.4 14880372      786.

1 ggplot으로 연도에 따른 기대 수명을 나라별로 그리기

2 ggplot을 ggplotly에 넣어서 interactive 그래프 만들기

3 interactive 그래프에 툴팁(tooltip : 마우스 올리면 글자 보이기) 넣기

4 Highlight 기능 넣기 (검색박스 만들기)

6 연결뷰 기능 (여러개 그래프 중 하나를 줌인 하면 다른 것도 줌인 되는 것)

  • 더블 클릭하면 원래 크기로 돌아옴

7 대륙별로 1인당 gdp (gdpPercap) 과 기대수명(lifeExp) 를 연도별로 그래프 그리기

(ggplotly 사용)

8 대륙별로 1인당 gdp (gdpPercap) 과 기대수명(lifeExp) 를 연도별로 그래프 그리기(애니메이션)

(geom_point(aes(frame = year) 사용하기)

9 그래프 따라하기(1)

theme(legend.position = “top”, axis.text.x = element_text(angle = 90, hjust = 1)) 사용하기

10 그래프 따라하기(2)

facet_wrap(~country, scale=“free”) 사용하기

12 애니메이션 구현하기

따라서 작성해보기

gif 파일로 저장하기

library(gganimate)

life <- gapminder %>% filter( country %in% c("Korea, Rep.","Korea, Dem. Rep.", "China", "United States", "Japan")) %>% 
  ggplot(aes(x=year, y=lifeExp, group = country, col=country))+
  geom_line(alpha=0.3, linewidth=1.5) +  
  geom_point(aes(frame = `year`), size=3.5) +
  # scale_x_date(date_breaks="1 week", date_labels="%m-%d") +
  # scale_y_continuous(labels=scales::percent) +
  theme_bw(base_family="NanumGothic") +
  
  labs(x="", y="기대수명", color="") +
  theme(legend.position = "top", 
        axis.text.x = element_text(angle = 90, hjust = 1),
        axis.text=element_text(size=16, color="black"), 
        legend.text=element_text(size=18), 
        plot.title = element_text(size=22)) 

gganimate(life)

gganimate(life, "기대수명.gif", ani.width = 640, ani.height = 480)

13 인터랙티브 그래프

  • 펭귄 종별 몸무게
  • 펭귄 지느러미 길이와 몸무게

14 ggiraph 패키지 연습해보기

library(ggiraph)

dat <- gapminder::gapminder |> 
  janitor::clean_names() |> 
  mutate(
    # Reformat continent as a character instead of as a factor
    # (will be important later)
    id = levels(continent)[as.numeric(continent)],
    continent = forcats::fct_reorder(continent, life_exp)
  )

color_palette <- thematic::okabe_ito(5)
names(color_palette) <- unique(dat$continent)
base_size <- 18
mean_life_exps <- dat |> 
  group_by(continent, year, id) |> 
  summarise(mean_life_exp = mean(life_exp)) |> 
  ungroup()

line_chart <- mean_life_exps |> 
  ggplot(aes(x = year, y = mean_life_exp, col = continent)) +
  geom_line(linewidth = 2.5) +
  geom_point(size = 4) +
  theme_minimal(base_size = base_size) +
  labs(
    x = element_blank(),
    y = 'Life expectancy (in years)',
    title = 'Life expectancy over time'
  ) +
  theme(
    text = element_text(
      color = 'grey20'
    ),
    legend.position = 'none',
    panel.grid.minor = element_blank(),
    plot.title.position = 'plot'
  ) +
  scale_color_manual(values = color_palette)
line_chart

library(ggiraph)

line_chart <- mean_life_exps |> 
  ggplot(aes(x = year, y = mean_life_exp, col = continent, data_id = id)) +
  geom_line_interactive(linewidth = 2.5) +
  geom_point_interactive(size = 4) +
  theme_minimal(base_size = base_size) +
  labs(
    x = element_blank(),
    y = 'Life expectancy (in years)',
    title = 'Life expectancy over time'
  ) +
  theme(
    text = element_text(
      color = 'grey20'
    ),
    legend.position = 'none',
    panel.grid.minor = element_blank(),
    plot.title.position = 'plot'
  ) +
  scale_color_manual(values = color_palette)

girafe(ggobj = line_chart)
girafe(
  ggobj = line_chart,
  options = list(
    opts_hover(css = ''), ## CSS code of line we're hovering over
    opts_hover_inv(css = "opacity:0.1;"), ## CSS code of all other lines
    opts_sizing(rescale = FALSE) ## Fixes sizes to dimensions below
  ),
  height_svg = 6,
  width_svg = 9
)

2 머신러닝 (회귀분석)

아래 kaggle 에서 중고차 가격 데이터를 다운로드 받아서 중고차 가격 예측하기

#라이브러리 불러오기
library(dplyr)
library(caret)
library(ModelMetrics)
library(rpart)
library(randomForest)
#데이터 불러오기
X_test <- read.csv("https://raw.githubusercontent.com/waterfirst/Data_visualization/main/data/used_car/X_test.csv", stringsAsFactors=T)
X_train <- read.csv("https://raw.githubusercontent.com/waterfirst/Data_visualization/main/data/used_car/X_train.csv", stringsAsFactors=T)
y_train <- read.csv("https://raw.githubusercontent.com/waterfirst/Data_visualization/main/data/used_car/y_train.csv", stringsAsFactors=T)
y_test <- read.csv("https://raw.githubusercontent.com/waterfirst/Data_visualization/main/data/used_car/y_train.csv", stringsAsFactors=T)

train<-inner_join(y_train, X_train)
head(train)
  carID price    brand     model year transmission mileage fuelType tax  mpg
1 13207 31995   hyundi  Santa Fe 2019    Semi-Auto    4223   Diesel 145 39.8
2 17314  7700 vauxhall       GTC 2015       Manual   47870   Diesel 125 60.1
3 12342 58990     audi       RS4 2019    Automatic    5151   Petrol 145 29.1
4 13426 12999       vw  Scirocco 2016    Automatic   20423   Diesel  30 57.6
5 16004 16990    skoda     Scala 2020    Semi-Auto    3569   Petrol 145 47.1
6 18964 40890     merc   V Class 2019    Automatic    4170   Diesel 145 44.1
  engineSize
1        2.2
2        2.0
3        2.9
4        2.0
5        1.0
6        2.1
str(train)
'data.frame':   4960 obs. of  11 variables:
 $ carID       : int  13207 17314 12342 13426 16004 18964 17053 19021 17429 16726 ...
 $ price       : int  31995 7700 58990 12999 16990 40890 25990 41980 25490 3491 ...
 $ brand       : Factor w/ 9 levels "audi","bmw","ford",..: 4 8 1 9 6 5 7 2 7 3 ...
 $ model       : Factor w/ 90 levels " 6 Series"," 7 Series",..: 69 35 63 71 70 80 55 51 16 45 ...
 $ year        : int  2019 2015 2019 2016 2020 2019 2020 2019 2019 2012 ...
 $ transmission: Factor w/ 4 levels "Automatic","Manual",..: 4 2 1 1 4 1 1 4 1 2 ...
 $ mileage     : int  4223 47870 5151 20423 3569 4170 3 101 6340 85843 ...
 $ fuelType    : Factor w/ 5 levels "Diesel","Electric",..: 1 1 5 1 5 1 3 5 3 5 ...
 $ tax         : num  145 125 145 30 145 145 135 145 135 30 ...
 $ mpg         : num  39.8 60.1 29.1 57.6 47.1 44.1 64.2 34 52.3 57.7 ...
 $ engineSize  : num  2.2 2 2.9 2 1 2.1 1.8 3 2.5 1.2 ...
str(X_test)
'data.frame':   2672 obs. of  10 variables:
 $ carID       : int  12000 12001 12004 12013 12017 12019 12020 12024 12027 12028 ...
 $ brand       : Factor w/ 9 levels "audi","bmw","ford",..: 5 9 5 6 1 8 5 5 7 9 ...
 $ model       : Factor w/ 89 levels " 6 Series"," 7 Series",..: 31 7 31 69 64 22 29 21 54 7 ...
 $ year        : int  2017 2017 2019 2019 2015 2020 2016 2015 2019 2020 ...
 $ transmission: Factor w/ 4 levels "Automatic","Manual",..: 1 1 1 2 4 4 2 4 1 1 ...
 $ mileage     : int  12046 37683 10000 3257 20982 1950 26809 33087 4793 5000 ...
 $ fuelType    : Factor w/ 5 levels "Diesel","Electric",..: 1 1 1 5 5 5 1 1 3 1 ...
 $ tax         : num  150 260 145 145 325 145 20 165 140 265 ...
 $ mpg         : num  37.2 36.2 34 49.6 29.4 40.4 67.3 51.4 64.2 33.6 ...
 $ engineSize  : num  3 3 3 1 4 1.2 2.1 3 1.8 3 ...
  • 중고차 가격을 3개의 머신러닝 모델을 이용하여 예측해보고, 가장 좋은 모델로 y_test 를 예측해서 csv 파일로 저장하라.

데이터 전처리, 불필요한 컬럼 제거, Na 처리, target y 확인

       price        brand         year transmission      mileage     fuelType 
           0            0            0            0            0            0 
         tax          mpg   engineSize 
           0            0            0 
       brand         year transmission      mileage     fuelType          tax 
           0            0            0            0            0            0 
         mpg   engineSize 
           0            0 

훈련/검증 데이터 나누기

   price    brand year transmission mileage fuelType tax  mpg engineSize
1  31995   hyundi 2019    Semi-Auto    4223   Diesel 145 39.8        2.2
2   7700 vauxhall 2015       Manual   47870   Diesel 125 60.1        2.0
4  12999       vw 2016    Automatic   20423   Diesel  30 57.6        2.0
6  40890     merc 2019    Automatic    4170   Diesel 145 44.1        2.1
8  41980      bmw 2019    Semi-Auto     101   Petrol 145 34.0        3.0
11 13750 vauxhall 2016       Manual   40230   Diesel 160 50.4        1.6
   price  brand year transmission mileage fuelType tax  mpg engineSize
3  58990   audi 2019    Automatic    5151   Petrol 145 29.1        2.9
5  16990  skoda 2020    Semi-Auto    3569   Petrol 145 47.1        1.0
7  25990 toyota 2020    Automatic       3   Hybrid 135 64.2        1.8
9  25490 toyota 2019    Automatic    6340   Hybrid 135 52.3        2.5
10  3491   ford 2012       Manual   85843   Petrol  30 57.7        1.2
13 46050     vw 2019    Automatic    5000   Diesel 145 33.6        2.0

모델 만들기 (rpart, glm, randomForest)

CART 

3473 samples
   8 predictor

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 3473, 3473, 3473, 3473, 3473, 3473, ... 
Resampling results across tuning parameters:

  cp         RMSE      Rsquared   MAE      
  0.1015714  11374.56  0.5138065   7875.244
  0.1293171  12458.48  0.4181405   8767.174
  0.3473557  14623.91  0.3369765  10520.659

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.1015714.
Generalized Linear Model 

3473 samples
   8 predictor

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 3473, 3473, 3473, 3473, 3473, 3473, ... 
Resampling results:

  RMSE      Rsquared   MAE     
  8880.687  0.7037728  5795.044

Call:
 randomForest(formula = price ~ ., data = train_df, ntree = 10) 
               Type of random forest: regression
                     Number of trees: 10
No. of variables tried at each split: 2

          Mean of squared residuals: 28715556
                    % Var explained: 89.36

예측하기

모델 성능 검증 (R2)

의사결정나무 모델 R2 =  0.5157428   
로지스틱 회귀분석 모델 R2 =  0.7377449   
랜덤포레스트 R2 =  0.9345108   

최종 모델로 모델링하고 예측하기


Call:
 randomForest(formula = price ~ ., data = train, ntree = 100) 
               Type of random forest: regression
                     Number of trees: 100
No. of variables tried at each split: 2

          Mean of squared residuals: 15599489
                    % Var explained: 94.2

데이터 제출하기

         x
1 38996.07
2 24447.39
3 57229.13
4 16062.26
5 48071.38
6 21900.47